Titanic Data Analysis By Gangadhara Naga Sai

Overview Of Titanic Dataset

In 1912, the ship RMS Titanic struck an iceberg on its maiden voyage and sank, resulting in the deaths of most of its passengers and crew. In this project, we will explore the RMS Titanic passenger manifest to determine whether someone survived or did not survive.Demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic Dataset is obtained from kaggle (https://www.kaggle.com/c/titanic/data).

Questions

Depending upon weather the passengers survived or did not survive based on the classification of
- Age
- Gender
- Passenger class
Shows difference in survival chances of Each indiviudal?

Data Wrangling



In [1]:

    
import numpy as np
import pandas as pd
from IPython.display import display

%matplotlib inline

# Load the dataset
files = 'titanic_data.csv'
data_titanic = pd.read_csv(files)
display(data_titanic.head())









    






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S

Data Description

From a sample of the RMS Titanic data, we can see the various features present for each passenger on the ship:

Survived: Outcome of survival (0 = No; 1 = Yes)
Pclass: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
Name: Name of passenger
Sex: Sex of the passenger
Age: Age of the passenger (Some entries contain NaN)
SibSp: Number of siblings and spouses of the passenger aboard
Parch: Number of parents and children of the passenger aboard
Ticket: Ticket number of the passenger
Fare: Fare paid by the passenger
Cabin Cabin number of the passenger (Some entries contain NaN)
Embarked: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)

Variable Notes

pclass: A proxy for socio-economic status (SES)

1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.



In [2]:

    
data =data_titanic

# Show the dataset 
display(data.head())
data.info()









    






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S
    
  








    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

From the above info(),We can see columns Age, Cabin and Embarked have missing values.

Handling the missing values:

Ignore the rows with missing data,

Exclude the variable at all or we might substite it with mean or median.

Age 80% of the data is available,which seems a important variable so not to exclude.

Port of embarkation doesn't seem interesting.

cabin 23% of the data so decided to exclude.

PassengerId,Name,fare doesnt seem to contribute to any survival investigation



In [3]:

    
#exculding some coloumns
del data['Ticket']
del data['Cabin']
del data['Embarked']
del data['Name']
del data['PassengerId']
del data['Fare']



In [4]:

    
data.describe(include='all')



In [5]:

    
# Calculate number of missing values
data.isnull().sum()









    Out[5]:





Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
dtype: int64



In [6]:

    
null_female = data[pd.isnull(data['Age'])]['Sex'] == 'female'
null_male = data[pd.isnull(data['Age'])]['Sex'] == 'male'

print "Total missing age for female:",null_female.sum()

print "Total missing age for male:",null_male.sum()









    



Total missing age for female: 53
Total missing age for male: 124

lets decide should we remove missing age rows or fill the missing values with the mean, I'm going first to split the sample data into 2 samples with missing age and with age and perform a t test



In [7]:

    
notnull_age = data[pd.notnull(data['Age'])]
null_age = data[pd.isnull(data['Age'])]

Hypothesis

To fill the missing data with mean, i will decide with t test by being sure that passengers in these 2 samples are likely to have the similar survival rate.

H0:means of the samples populatins are equal
H1: population means are different

If the resulted p value is going to be less than the critical value (with alpha level 0.05), I should reject the null hypothesis and conclude that population means are different not by chance (Ignoring the data of missing data which almost 20% of the data should be neglected).

I'm using the existing in scipy.stats function to perform t test for independent variables:



In [8]:

    
from scipy.stats import ttest_ind
ttest_ind(notnull_age['Survived'], null_age['Survived'])









    Out[8]:





Ttest_indResult(statistic=2.7606993230995345, pvalue=0.0058865348400755626)

p value is than 0.05 which results in rejecting H0 ,so there is a significant difference in mean .So, I'm going to substitute the missing values with the mean age.



In [9]:

    
print "Age median values by Age and Sex:"
#we are grouping by gender and class and taking median of age so we can replace with corrresponding values instead of NaN
print data.groupby(['Sex','Pclass'], as_index=False).median().loc[:, ['Sex','Pclass', 'Age']]
print "Age values for 5 first persons in dataset:"
print data.loc[data['Age'].isnull(),['Age','Sex','Pclass']].head(5)
# apply transformation: Age missing values are filled with regard to Pclass and Sex:
data.loc[:, 'Age'] = data.groupby(['Sex','Pclass']).transform(lambda x: x.fillna(x.median()))
print data.loc[[5,17,19,26,28],['Age','Sex','Pclass']].head(5)
data['Age'] = data['Age'].fillna(data['Age'].mean())









    



Age median values by Age and Sex:
      Sex  Pclass   Age
0  female       1  35.0
1  female       2  28.0
2  female       3  21.5
3    male       1  40.0
4    male       2  30.0
5    male       3  25.0
Age values for 5 first persons in dataset:
    Age     Sex  Pclass
5   NaN    male       3
17  NaN    male       2
19  NaN  female       3
26  NaN    male       3
28  NaN  female       3
     Age     Sex  Pclass
5   25.0    male       3
17  30.0    male       2
19  21.5  female       3
26  25.0    male       3
28  21.5  female       3



In [10]:

    
data.describe(include='all')

We can see that all columns have identical length.

Data Exploration and Visualization



In [11]:

    
data_s=data
survival_group = data_s.groupby('Survived')
survival_group.describe()

From the above statistics

Youngest to survive: 0.42
Youngest to die: 1.0
Oldest to survive: 80.0
Oldest to die: 74.0



In [12]:

    
#  Seriously i dont understand why age is 0.42
data_s[data_s['Age'] < 1]

These must be new borns and all survived



In [13]:

    
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for all graphs
#sns.set_style("light")
#sns.set_style("whitegrid")
sns.set_style("ticks", {"xtick.major.size": 8, "ytick.major.size": 8})



In [14]:

    
def plot(a,i):
    fig=plt.figure() #Plots in matplotlib reside within a figure object, use plt.figure to create new figure
    #Create one or more subplots using add_subplot, because you can't create blank figure
    ax = fig.add_subplot(1,1,1)
    #Variable
    ax.hist(data[a],bins = i) # Here you can play with number of bins
    plt.title(a + ' distribution')
    plt.xlabel(a)
    plt.ylabel('Passenger Count')
    plt.show()



In [15]:

    
plot("Age",30)
print "The above distribution of Age seems a little bit deviating from normal distribution"
print 
plot("SibSp",8)
print "The above distribution of SibSp seems a right-skewed distribution"
plot("Parch",6)
print "The above distribution of Age seems a right-skewed distribution"









    












    



The above distribution of Age seems a little bit deviating from normal distribution







    












    



The above distribution of SibSp seems a right-skewed distribution






    












    



The above distribution of Age seems a right-skewed distribution



In [16]:

    
sns.factorplot(x="Sex", y="Age", data=data_s, kind="box", size=7, aspect=.8)\
.set_xticklabels(["Male","Female"])
plt.title('Boxplot of Age grouped by sex')
print "From the below plot we can see there are more elderly men than women and average age for men is higher than women"









    



From the below plot we can see there are more elderly men than women and average age for men is higher than women

From the above plot we can see that gender played an important role in survival of each individaul

Female Survival rate : 74.2%

Male Survival rate: 18.8%



In [17]:

    
sns.factorplot(x="Pclass", y="Age", data=data_s, kind="box", size=7, aspect=.8)\
.set_xticklabels(["1","2","3"])
plt.title('Boxplot of Age grouped by sex')
print "From the below plot we can see the average age is decreasing from calss 1 to class 3"









    



From the below plot we can see the average age is decreasing from calss 1 to class 3

From the above plot we can clearly see individuals of different class distibuted for various ages. And the red line shows the average of age for each class



In [18]:

    
sns.factorplot( 'Sex' , 'Survived', data = data, kind = 'bar')
plt.title('Histogram of Survival rate grouped by Sex')
print "From the plot we can clearly see the survival rate of female is very high"









    



From the plot we can clearly see the survival rate of female is very high



In [19]:

    
## GENDER
survivals = pd.crosstab([ data_s.Sex], data_s.Survived.astype(bool))
survivals.plot(kind='bar', stacked=False)
plt.ylabel("Passenger count")
plt.title('Histogram of Passenger count grouped by Sex and survived')

survival = data_s.groupby('Sex')['Survived']
survival.mean()









    Out[19]:





Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64



In [20]:

    
#PCLASS

survivals = pd.crosstab([data_s.Pclass], data_s.Survived.astype(bool))
survivals.plot(kind='bar', stacked=True)
plt.ylabel("Passenger count")
plt.title('Histogram of Passenger count grouped by Class')

survival=data.groupby(['Pclass'])
survival.mean()

A passenger from Class 1 is about 2.5x times more likely to survive than a passenger in Class 3.

Social-economic standing was a factor in survival rate of passengers.

Class 1: 62.96%
Class 2: 47.28%
Class 3: 24.24%



In [30]:

    
survivals = pd.crosstab([data_s.Pclass, data_s.Sex], data_s.Survived.astype(bool))
survivals.plot(kind='bar', stacked=True)
survive=data.groupby(['Sex','Pclass'])
plt.ylabel("Passenger count")
plt.title('Histogram of passenger count grouped by sex and Class')

#survive.Survived.sum().plot(kind='barh')
survive.mean()

From the above plot we can see how female individuals are given 1st preference and based on class.

Social-economic standing was a factor in survival rate of passengers by gender

Class 1 - female survival rate: 96.81%
Class 1 - male survival rate: 36.89%
Class 2 - female survival rate: 92.11%
Class 2 - male survival rate: 15.74%
Class 3 - female survival rate: 50.0%
Class 3 - male survival rate: 13.54%



In [23]:

    
#Age
sns.factorplot(x="Survived", y="Age", hue='Sex', data=data_s, kind="box", size=7, aspect=.8)\
.set_xticklabels(["Expired","Survived"])
plt.title('Boxplot of Age grouped by sex and Survival')
# survive_A=data.groupby(['Sex','Age'])









    Out[23]:





<matplotlib.text.Text at 0xd8ae470>

From the above boxplot and calculated mean:

Irrespective of sex and class, age was not a deciding factor in the passenger survival rate
Average age for surived and not survived seemed almost same from the boxplot



In [32]:

    
#Age
# We are dividing the Age data into 3 buckets of (0-18),(18-40),(40-90)
# and labeling them as 'Childs','Adults','Seniors' respectively
data['group_age']  = pd.cut(data['Age'], bins=[0,18,40,90], labels=['Childs','Adults','Seniors'])

data.head(5)
    
survive_a=data.groupby(['group_age'])
survival_a = pd.crosstab([data.group_age], data_s.Survived.astype(bool))
survival_a.plot(kind='bar', stacked=True)
plt.title('Bar plot of Passenger count grouped by age categories ')
plt.ylabel("Passenger count")

# sns.factorplot(x="group_age", y="Age", hue='Sex', data=data, kind="box", size=7, aspect=.8)

survive_a.mean()

These are percentage of survivors for Group_age

Adult : 36.0%
Child : 50.35%
Senior: 36.66%

Women and children have preference First to lifeboats?



In [25]:

    
def group(d,v):
    if (d == 'female') and (v >= 18):
        return 'Woman'
    elif v < 18:
        return 'child'                        
    elif (d == 'male') and (v >= 18): 
        return  'Man'

data['Category'] = data.apply(lambda row:group(row['Sex'], row['Age']), axis=1) 
data.head(5)









    Out[25]:






  
    
      
      Survived
      Pclass
      Sex
      Age
      SibSp
      Parch
      group_age
      Category
    
  
  
    
      0
      0
      3
      male
      22.0
      1
      0
      Adults
      Man
    
    
      1
      1
      1
      female
      38.0
      1
      0
      Adults
      Woman
    
    
      2
      1
      3
      female
      26.0
      0
      0
      Adults
      Woman
    
    
      3
      1
      1
      female
      35.0
      1
      0
      Adults
      Woman
    
    
      4
      0
      3
      male
      35.0
      0
      0
      Adults
      Man



In [26]:

    
survival_a = pd.crosstab([data.Category], data_s.Survived.astype(bool))
survival_a.plot(kind='bar', stacked=True)
plt.ylabel("Passenger count")
plt.title('Survival by Age category')
data.groupby(['Category']).mean()["Survived"]









    Out[26]:





Category
Man      0.165703
Woman    0.752896
child    0.539823
Name: Survived, dtype: float64

Women and children are given importance in the survival of a number of people.

Man 16.57%
Women 75.2%
Child 54%



In [27]:

    
g = sns.factorplot(x="Category", y="Survived", col="Pclass", data=data, 
                   saturation=.5, kind="bar", ci=None, size=5, aspect=.8)

# Fix up the labels
(g.set_axis_labels('', 'Survival Rate')
     .set_xticklabels(["Men", "Women","child"])
     .set_titles("Class {col_name}")
     .set(ylim=(0, 1))
     .despine(left=True, bottom=True))
print 'Histogram of Survival rate grouped by Age Category and Class:'









    



Histogram of Survival rate grouped by Age Category and Class:



In [28]:

    
# We are dividing the Age data into 3 buckets of (0-18),(18-40),(40-90)
# and labeling them as 'Childs','Adults','Seniors' respectively
data['group_age']  = pd.cut(data['Age'], bins=[0,18,40,90], labels=['Childs','Adults','Seniors'])

#finding mean Survival rate grouped by 'group_age','Sex'.
df = data.groupby(['group_age','Sex'],as_index=False).mean().loc[:,['group_age','Sex','Survived']]

f, (ax1, ax2,ax3) = plt.subplots(1, 3,figsize=(15,7))
g = sns.barplot(x="group_age", y="Survived", hue="Sex", data=df,ax=ax1)
ax1.set_title('Survival by Age and Sex')

#finding mean Survival rate grouped by 'group_age'.
data2 = data.groupby(['group_age'],as_index=False).mean().loc[:,['group_age','Survived']]

h = sns.barplot(x="group_age",y='Survived', data=data2,ax=ax2)
ax2.set_title('Survival by Age')

#finding mean Survival rate grouped by 'group_age'.
data3 = data.groupby(['group_age'],as_index=False).count().loc[:,['group_age','Survived']]

hh = sns.barplot(x="group_age",y='Survived', data=data3,ax=ax3)
ax3.set_title('Age distribution in Ship')
ax3.set_ylabel('Age Distribution')
for ax in f.axes:
    plt.sca(ax)
    plt.xticks(rotation=90)
plt.tight_layout()
plt.show()



In [29]:

    
data_C=data.groupby(['Category',"Pclass"]).mean()
data_C.sort("Survived")["Survived"]









    



D:\Anaconda2\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  from ipykernel import kernelapp as app






    Out[29]:





Category  Pclass
Man       2         0.082474
          3         0.121711
          1         0.347458
child     3         0.371795
Woman     3         0.486239
          2         0.906250
child     2         0.913043
          1         0.916667
Woman     1         0.976744
Name: Survived, dtype: float64

From the above values we can see that the survival rate is increasing from top to bottom. And the from the plot we can see the distribution of survival rate among men ,women and children,based on class.

Conclusion

We observe a order of survival rate based on Age ,Sex and Class:

children and women of upper class
children and women of middle class
women of lower class
children of lower class
men of upper class
finally men of the middle class and lower class have least survival rate

The analysis seems that , A female with upper social-economic standing (first class) and Children,had the best chance of survival. Age did not seem to be a major factor.Man in third class, had the lowest chance of survival. Women and children of all classes, were mostly having a higher survival rate than men in general.

Limitations:

Part men and women were missing Age data and were replaced, grouping by gender and class and taking median of age so we can replace with corrresponding values instead of NaN as calculations which could have skewed.

	Survived	Pclass	Sex	Age	SibSp	Parch
count	891.000000	891.000000	891	714.000000	891.000000	891.000000
unique	NaN	NaN	2	NaN	NaN	NaN
top	NaN	NaN	male	NaN	NaN	NaN
freq	NaN	NaN	577	NaN	NaN	NaN
mean	0.383838	2.308642	NaN	29.699118	0.523008	0.381594
std	0.486592	0.836071	NaN	14.526497	1.102743	0.806057
min	0.000000	1.000000	NaN	0.420000	0.000000	0.000000
25%	0.000000	2.000000	NaN	20.125000	0.000000	0.000000
50%	0.000000	3.000000	NaN	28.000000	0.000000	0.000000
75%	1.000000	3.000000	NaN	38.000000	1.000000	0.000000
max	1.000000	3.000000	NaN	80.000000	8.000000	6.000000

	Survived	Pclass	Sex	Age	SibSp	Parch
count	891.000000	891.000000	891	891.000000	891.000000	891.000000
unique	NaN	NaN	2	NaN	NaN	NaN
top	NaN	NaN	male	NaN	NaN	NaN
freq	NaN	NaN	577	NaN	NaN	NaN
mean	0.383838	2.308642	NaN	29.112424	0.523008	0.381594
std	0.486592	0.836071	NaN	13.304424	1.102743	0.806057
min	0.000000	1.000000	NaN	0.420000	0.000000	0.000000
25%	0.000000	2.000000	NaN	21.500000	0.000000	0.000000
50%	0.000000	3.000000	NaN	26.000000	0.000000	0.000000
75%	1.000000	3.000000	NaN	36.000000	1.000000	0.000000
max	1.000000	3.000000	NaN	80.000000	8.000000	6.000000

		Age	Parch	Pclass	SibSp
Survived
0	count	549.000000	549.000000	549.000000	549.000000
	mean	29.737705	0.329690	2.531876	0.553734
	std	12.818264	0.823166	0.735805	1.288399
	min	1.000000	0.000000	1.000000	0.000000
	25%	22.000000	0.000000	2.000000	0.000000
	50%	25.000000	0.000000	3.000000	0.000000
	75%	37.000000	0.000000	3.000000	1.000000
	max	74.000000	6.000000	3.000000	8.000000
1	count	342.000000	342.000000	342.000000	342.000000
	mean	28.108684	0.464912	1.950292	0.473684
	std	14.010565	0.771712	0.863321	0.708688
	min	0.420000	0.000000	1.000000	0.000000
	25%	21.000000	0.000000	1.000000	0.000000
	50%	27.000000	0.000000	2.000000	0.000000
	75%	36.000000	1.000000	3.000000	1.000000
	max	80.000000	5.000000	3.000000	4.000000

	Survived	Age	SibSp	Parch
Pclass
1	0.629630	38.270463	0.416667	0.356481
2	0.472826	29.863207	0.402174	0.380435
3	0.242363	24.802281	0.615071	0.393075

		Survived	Age	SibSp	Parch
Sex	Pclass
female	1	0.968085	34.648936	0.553191	0.457447
	2	0.921053	28.703947	0.486842	0.605263
	3	0.500000	21.677083	0.895833	0.798611
male	1	0.368852	41.060820	0.311475	0.278689
	2	0.157407	30.678981	0.342593	0.222222
	3	0.135447	26.099193	0.498559	0.224784

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	Survived	Pclass	Sex	Age	SibSp	Parch
78	1	2	male	0.83	0	2
305	1	1	male	0.92	1	2
469	1	3	female	0.75	2	1
644	1	3	female	0.75	2	1
755	1	2	male	0.67	1	1
803	1	3	male	0.42	0	1
831	1	2	male	0.83	1	1

	Survived	Pclass	Age	SibSp	Parch
group_age
Childs	0.503597	2.561151	10.717050	1.258993	0.935252
Adults	0.360465	2.387043	27.889535	0.403654	0.255814
Seniors	0.366667	1.760000	51.066667	0.320000	0.373333